Goto

Collaborating Authors

 overparameterized linear regression


Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression

Neural Information Processing Systems

One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression. In many scenarios, this failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. This paper demonstrate that the generalization behavior of overparameterized model should be analyzed in a both data-relevant and algorithm-relevant manner. To make a formal characterization, We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory, instead of traditional last-iterate analysis.


On the Optimal Weighted \ell_2 Regularization in Overparameterized Linear Regression

Neural Information Processing Systems

Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $\lambda_{\opt}$ for the ridge parameter $\lambda$ and confirm the implicit $\ell_2$ regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that $\lambda_{\opt}$ can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when $\vX$ and $\vbeta_{\star}$ are both anisotropic. Finally, we determine the optimal weighting matrix $\vSigma_w$ for both the ridgeless ($\lambda\to 0$) and optimally regularized ($\lambda = \lambda_{\opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.




Review for NeurIPS paper: On the Optimal Weighted \ell_2 Regularization in Overparameterized Linear Regression

Neural Information Processing Systems

Weaknesses: The main issue I have with the paper is about the novelty of the results. The authors mention that previous work on linear regression is not as general as current work. In particular, they either only allow isotropic features or signal. This paper which is arXived about a month before the NeurIPS deadline seems to do both: [1] Emami, Melikasadat, et al. "Generalization error of generalized linear models in high dimensions." The results of this paper allow to characterize the exact generalization error in the same asymptotic limit for Guassian data with general covariance and any regularization, which includes the \ell_2 type regularzations considered here, as well as more general regularizations like general \ell_p norms. Here are my understanding of the differences of the results of the two papers: - In [1] the authors allow for a Gaussian feature with any covariance matrix, whereas your paper allow non-Gaussina features so long as they have bounded 12th centered-moment.


Review for NeurIPS paper: On the Optimal Weighted \ell_2 Regularization in Overparameterized Linear Regression

Neural Information Processing Systems

The paper received three positive reviews. Most of the minor concerns raised in the initial reviews have been addressed in the rebuttal. The area chair agrees with the reviewers' assessment and follows their recommendation.


Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression

Neural Information Processing Systems

One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression. In many scenarios, this failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. This paper demonstrate that the generalization behavior of overparameterized model should be analyzed in a both data-relevant and algorithm-relevant manner. To make a formal characterization, We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory, instead of traditional last-iterate analysis. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting.


On the Optimal Weighted \ell_2 Regularization in Overparameterized Linear Regression

Neural Information Processing Systems

Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting \lambda_{\opt} for the ridge parameter \lambda and confirm the implicit \ell_2 regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that \lambda_{\opt} can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when \vX and \vbeta_{\star} are both anisotropic. Finally, we determine the optimal weighting matrix \vSigma_w for both the ridgeless ( \lambda\to 0) and optimally regularized ( \lambda \lambda_{\opt}) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.